{
"cells": [
{
"cell_type": "markdown",
"id": "dc4fdf34-07dd-4cc2-8351-69fb6f5362f0",
"metadata": {},
"source": [
"## 9 - Confirmatory Factor Analysis\n",
"### Fitting latent variables onto data"
]
},
{
"cell_type": "markdown",
"id": "12f1d4ae-011a-4aa5-823c-eb8ced9025ae",
"metadata": {},
"source": [
"We continue our journey in multivariate data analysis with the sister technique of exploratory factor analysis - **confirmatory factor analysis** (CFA). Much like EFA, CFA is a framework that draws no distinction between what variables are predictors or outcomes. But unlike EFA, the aim of CFA is to instead a-priori specify a set of latent variables and how some observed data might have given rise to them, and see how well that model specification fits the data. \n",
"\n",
"That distinction is very important. While EFA will find a number of latent variables that best explain the data you have collected, in CFA it is **your job** to specify the number of latent variables *and* their relationships with the data you have collected to see how well this model fits - remember that EFA does that for you by computing the loadings of each observed variable on a latent variable, and we can infer what is associated with what from that. In CFA, we have to suggest, up-front, what latent variable is associated with what observed variables, and whether the latent variables are associated themselves! Note these are all linear models as we have seen before, but now on a more complex and interwoven scale.\n",
"\n",
"How do we come up with these CFA models, determining what is associated with what? This is why EFA and CFA are so interlinked. Theory and EFA can guide us in figuring out what is associated with what, and we can take new data and specify a CFA model to see how well it does. Thus, CFA often follows EFA in modern research practice.\n",
"\n",
"#### Fitting CFA models in Python - `semopy`\n",
"To fit these models in Python, we will use a package called `semopy`. This package is a general-purpose modelling package built specifically for another type of use case - structural equation models (hence the `sem`), of which CFA is actually a specific type. It has a familiar but slightly different syntax for fitting models, as we will explore below. First, lets import it and everything else we've used so far."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c9e064d8-bd31-4778-a21a-ab2202090709",
"metadata": {},
"outputs": [],
"source": [
"# Import what we need\n",
"import pandas as pd # dataframes\n",
"import seaborn as sns # plots\n",
"import statsmodels.formula.api as smf # Models\n",
"import marginaleffects as me # marginal effects\n",
"import numpy as np # numpy for some functions\n",
"import pingouin as pg\n",
"from factor_analyzer import FactorAnalyzer # for EFA\n",
"import semopy as sem\n"
]
},
{
"cell_type": "markdown",
"id": "97be41da-1eec-4896-aa32-8496efc69212",
"metadata": {},
"source": [
"To get started, let's read in some data for which we have a sense of what the latent variables might be. We'll rely again on a personality measure. In the \"nerdiness scale\" dataset, there is actually a useful, very short measure of the Big Five, called the TIPI - the Ten Item Personality Inventory. Download the data here: http://openpsychometrics.org/_rawdata/NPAS-data-16December2018.zip\n",
"\n",
"I've saved it as `data_nerd.csv` already, so below it is read in, and we extract the columns with TIPI in the column name."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e5201e9c-cf23-462e-95d0-ec7505c80138",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
TIPI1
\n",
"
TIPI2
\n",
"
TIPI3
\n",
"
TIPI4
\n",
"
TIPI5
\n",
"
TIPI6
\n",
"
TIPI7
\n",
"
TIPI8
\n",
"
TIPI9
\n",
"
TIPI10
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
1
\n",
"
7
\n",
"
1
\n",
"
6
\n",
"
7
\n",
"
7
\n",
"
1
\n",
"
7
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
5
\n",
"
3
\n",
"
5
\n",
"
4
\n",
"
5
\n",
"
5
\n",
"
5
\n",
"
5
\n",
"
6
\n",
"
1
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
6
\n",
"
3
\n",
"
7
\n",
"
6
\n",
"
7
\n",
"
2
\n",
"
6
\n",
"
2
\n",
"
1
\n",
"
\n",
"
\n",
"
3
\n",
"
7
\n",
"
2
\n",
"
7
\n",
"
4
\n",
"
7
\n",
"
2
\n",
"
6
\n",
"
1
\n",
"
7
\n",
"
1
\n",
"
\n",
"
\n",
"
4
\n",
"
2
\n",
"
5
\n",
"
4
\n",
"
3
\n",
"
7
\n",
"
6
\n",
"
5
\n",
"
5
\n",
"
6
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TIPI1 TIPI2 TIPI3 TIPI4 TIPI5 TIPI6 TIPI7 TIPI8 TIPI9 TIPI10\n",
"0 1 1 7 1 6 7 7 1 7 1\n",
"1 5 3 5 4 5 5 5 5 6 1\n",
"2 1 6 3 7 6 7 2 6 2 1\n",
"3 7 2 7 4 7 2 6 1 7 1\n",
"4 2 5 4 3 7 6 5 5 6 2"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read in nerdiness data, keeping only the columns with 'TIPI' in it\n",
"df = pd.read_csv('data_nerd.csv', sep='\\t', usecols=lambda x: 'TIPI' in x) # A little shortcut to filter quickly!\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "3f230528-2a44-4f2c-a5cd-2b8d436e65e6",
"metadata": {},
"source": [
"The questions are as follows:\n",
"\n",
"- TIPI1\t- Extraverted, enthusiastic.\n",
"- TIPI2\t- Critical, quarrelsome.\n",
"- TIPI3\t- Dependable, self-disciplined.\n",
"- TIPI4\t- Anxious, easily upset.\n",
"- TIPI5\t- Open to new experiences, complex.\n",
"- TIPI6\t- Reserved, quiet.\n",
"- TIPI7\t- Sympathetic, warm.\n",
"- TIPI8\t- Disorganized, careless.\n",
"- TIPI9\t- Calm, emotionally stable.\n",
"- TIPI10 - Conventional, uncreative.\n",
"\n",
"Now what? Imagine we've already collected a previous sample from this dataset, and an EFA suggests the presence of 5 latent factors (it should do - its measuring the Big 5!). How might we now test whether this is correct or not?\n",
"\n",
"We specify a model in which there are 5 latent factors, and **only** responses to certain questions load on them. Specifically we might say something like this, in plain English:\n",
"\n",
"- \"The Extraversion latent variable comes from TIPI1 and TIPI6\"\n",
"- \"The Agreeableness latent variable comes from TIPI2 and TIPI7\"\n",
"- \"The Conscientiousness latent variable comes from TIPI3 and TIPI8\"\n",
"- \"The Neuroticism latent variable comes from TIPI4 and TIPI9\"\n",
"- \"The Openness latent variable comes from TIPI5 and TIPI10\"\n",
"\n",
"Indeed, those are the questions the authors of the scale posit measure those traits. How much does the data support that kind of model? What we're saying here is that there are 5 latent factors, and they load ONLY on the questions we say they do. Is that a good model? \n",
"\n",
"Let's specify this in `semopy`.\n",
"\n",
"#### CFA in `semopy`\n",
"We first do this by laying out a long formula string, which we do in triple-quotations, which allows to write long paragraphs of text as a string variable. Notice we are writing linear models as before, but we use a new symbol - `=~` - which means \"the thing on the left is a latent variable that gives rise to the things on the right\". Notice too we can name the latent variables with whatever name we like:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "58b7b266-0a7f-4271-86a7-ba71e59e9d21",
"metadata": {},
"outputs": [],
"source": [
"# Create a model string for CFA!\n",
"cfa_model = \"\"\"\n",
"extraversion =~ TIPI1 + TIPI6\n",
"agreeableness =~ TIPI2 + TIPI7\n",
"conscientious =~ TIPI3 + TIPI8\n",
"neuroticism =~ TIPI4 + TIPI9\n",
"openness =~ TIPI5 + TIPI10\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "31385830-8438-468c-9432-5d4b373b559b",
"metadata": {},
"source": [
"Once we have that, we pass it into `semopy` which has a `Model` function, like so:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3fd27397-7183-470e-9e87-64b82c6af428",
"metadata": {},
"outputs": [],
"source": [
"# Create a model from the string\n",
"model = sem.Model(cfa_model)"
]
},
{
"cell_type": "markdown",
"id": "309e51f0-c8f9-4907-9715-61be0919e657",
"metadata": {},
"source": [
"And then, in a slightly different way to `statsmodels`, we fit it to the dataset like this:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "eed7bb4b-34e5-4f77-8c59-ba7945902f72",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SolverResult(fun=0.2655352957417101, success=True, n_it=64, x=array([-5.07648299e-01, -3.35199277e+00, -5.49388913e-01, -1.60867351e+00,\n",
" -2.48185202e-01, 1.87442078e-16, 2.60000298e+00, 3.23650506e+00,\n",
" 4.19420084e-01, 2.58313761e+00, 0.00000000e+00, 2.14705944e+00,\n",
" 0.00000000e+00, 3.06230987e+00, 1.96602959e-01, 2.62027984e-01,\n",
" -8.79052543e-02, -2.40102364e-01, 2.26342724e-02, -9.82637531e-02,\n",
" 2.59133618e+00, 1.86467945e-01, -5.75723949e-01, 1.99518616e-01,\n",
" 3.89764596e+00, -2.52805964e-01, 1.29393368e+00, 2.00616766e+00,\n",
" 7.09319920e-01, -2.74134742e-01]), message='Optimization terminated successfully', name_method='SLSQP', name_obj='MLW')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fit it to the data, observing its successful print message\n",
"model.fit(df)"
]
},
{
"cell_type": "markdown",
"id": "c4c2eb10-48a7-416a-b6f6-77f4813c573a",
"metadata": {},
"source": [
"While that is easy enough, we now turn to the difficulty of interpreting these models."
]
},
{
"cell_type": "markdown",
"id": "6e815470-9f95-4270-a1fe-3b3e52b594f5",
"metadata": {},
"source": [
"### Interpreting CFA models - coefficient estimates\n",
"First, we might want to check whether the coefficients make any sense in our model. After all, we specified the model similarly to how we have done before, so it makes sense to see how each variable is associated. This is akin to checking the loading matrices in EFA - we are seeing how the observed variables are associated with the latent variables we specified. We can access this like so, by **inspecting** the model:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3c069a89-00a6-41ab-aebd-1209d0e7e53e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" lval op rval Estimate Est. Std Std. Err z-value p-value\n",
"0 TIPI1 ~ extraversion 1.000000 1.000000 - - -\n",
"1 TIPI6 ~ extraversion -0.507648 -0.564553 0.014047 -36.139736 0.0\n",
"2 TIPI2 ~ agreeableness 1.000000 0.273672 - - -\n",
"3 TIPI7 ~ agreeableness -3.351993 -1.000000 0.281172 -11.921488 0.0\n",
"4 TIPI3 ~ conscientious 1.000000 0.927735 - - -\n",
"5 TIPI8 ~ conscientious -0.549389 -0.451050 0.02174 -25.271048 0.0\n",
"6 TIPI4 ~ neuroticism 1.000000 0.577702 - - -\n",
"7 TIPI9 ~ neuroticism -1.608674 -0.971876 0.043491 -36.988927 0.0\n",
"8 TIPI5 ~ openness 1.000000 1.000000 - - -\n",
"9 TIPI10 ~ openness -0.248185 -0.213005 0.023915 -10.377881 0.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the TIPI-factor associations\n",
"coefs.loc[0:9]"
]
},
{
"cell_type": "markdown",
"id": "7937ef86-cd1c-413e-ad6e-b1013cb47453",
"metadata": {},
"source": [
"These show the loadings of each observed variable onto the factor. Why are some of them fixed to 1? This is a quirk of the way these models are estimated. The first variable associated with the factor gets its parameter fixed to 1 so as to set a unit or a scale for the latent variable. For example, consider extraversion. The TIPI6 slope is negative. Why? Because TIPI6 asks 'are you reserved, quiet?', and TIPI1 - which sets the scale - asks \"are you extraverted, enthusiastic?\". Naturally it makes sense for a negative association to appear. Nonetheless, the `Est.Std` column reveals the correlation between each variable and its latent factor, if it can be estimated. The rest of the table shows the variances and covariances of the predicted data - that is, anything that is indicated by a `~~` symbol indicates the variance/covariance between the left and right hand side variable names:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a266ea49-ea22-4560-a951-1332ed0dddb3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
lval
\n",
"
op
\n",
"
rval
\n",
"
Estimate
\n",
"
Est. Std
\n",
"
Std. Err
\n",
"
z-value
\n",
"
p-value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
10
\n",
"
agreeableness
\n",
"
~~
\n",
"
agreeableness
\n",
"
0.26
\n",
"
1.00
\n",
"
0.024277
\n",
"
10.793053
\n",
"
0.0
\n",
"
\n",
"
\n",
"
11
\n",
"
agreeableness
\n",
"
~~
\n",
"
conscientious
\n",
"
-0.09
\n",
"
-0.11
\n",
"
0.009263
\n",
"
-9.490274
\n",
"
0.0
\n",
"
\n",
"
\n",
"
12
\n",
"
agreeableness
\n",
"
~~
\n",
"
extraversion
\n",
"
-0.24
\n",
"
-0.24
\n",
"
0.021176
\n",
"
-11.338675
\n",
"
0.0
\n",
"
\n",
"
\n",
"
13
\n",
"
agreeableness
\n",
"
~~
\n",
"
neuroticism
\n",
"
0.02
\n",
"
0.04
\n",
"
0.004264
\n",
"
5.307691
\n",
"
0.0
\n",
"
\n",
"
\n",
"
14
\n",
"
agreeableness
\n",
"
~~
\n",
"
openness
\n",
"
-0.10
\n",
"
-0.14
\n",
"
0.009443
\n",
"
-10.406511
\n",
"
0.0
\n",
"
\n",
"
\n",
"
15
\n",
"
conscientious
\n",
"
~~
\n",
"
conscientious
\n",
"
2.59
\n",
"
1.00
\n",
"
0.101503
\n",
"
25.5296
\n",
"
0.0
\n",
"
\n",
"
\n",
"
16
\n",
"
conscientious
\n",
"
~~
\n",
"
extraversion
\n",
"
0.19
\n",
"
0.06
\n",
"
0.021543
\n",
"
8.655804
\n",
"
0.0
\n",
"
\n",
"
\n",
"
17
\n",
"
conscientious
\n",
"
~~
\n",
"
neuroticism
\n",
"
-0.58
\n",
"
-0.31
\n",
"
0.020122
\n",
"
-28.611052
\n",
"
0.0
\n",
"
\n",
"
\n",
"
18
\n",
"
conscientious
\n",
"
~~
\n",
"
openness
\n",
"
0.20
\n",
"
0.09
\n",
"
0.015485
\n",
"
12.884486
\n",
"
0.0
\n",
"
\n",
"
\n",
"
19
\n",
"
extraversion
\n",
"
~~
\n",
"
extraversion
\n",
"
3.90
\n",
"
1.00
\n",
"
0.107465
\n",
"
36.269113
\n",
"
0.0
\n",
"
\n",
"
\n",
"
20
\n",
"
extraversion
\n",
"
~~
\n",
"
neuroticism
\n",
"
-0.25
\n",
"
-0.11
\n",
"
0.016061
\n",
"
-15.740773
\n",
"
0.0
\n",
"
\n",
"
\n",
"
21
\n",
"
neuroticism
\n",
"
~~
\n",
"
neuroticism
\n",
"
1.29
\n",
"
1.00
\n",
"
0.041927
\n",
"
30.861919
\n",
"
0.0
\n",
"
\n",
"
\n",
"
22
\n",
"
openness
\n",
"
~~
\n",
"
openness
\n",
"
2.01
\n",
"
1.00
\n",
"
0.185288
\n",
"
10.827292
\n",
"
0.0
\n",
"
\n",
"
\n",
"
23
\n",
"
openness
\n",
"
~~
\n",
"
extraversion
\n",
"
0.71
\n",
"
0.25
\n",
"
0.018164
\n",
"
39.051735
\n",
"
0.0
\n",
"
\n",
"
\n",
"
24
\n",
"
openness
\n",
"
~~
\n",
"
neuroticism
\n",
"
-0.27
\n",
"
-0.17
\n",
"
0.01279
\n",
"
-21.433455
\n",
"
0.0
\n",
"
\n",
"
\n",
"
25
\n",
"
TIPI1
\n",
"
~~
\n",
"
TIPI1
\n",
"
0.00
\n",
"
0.00
\n",
"
0.101706
\n",
"
0.0
\n",
"
1.0
\n",
"
\n",
"
\n",
"
26
\n",
"
TIPI10
\n",
"
~~
\n",
"
TIPI10
\n",
"
2.60
\n",
"
0.95
\n",
"
0.025788
\n",
"
100.823661
\n",
"
0.0
\n",
"
\n",
"
\n",
"
27
\n",
"
TIPI2
\n",
"
~~
\n",
"
TIPI2
\n",
"
3.24
\n",
"
0.93
\n",
"
0.035777
\n",
"
90.464184
\n",
"
0.0
\n",
"
\n",
"
\n",
"
28
\n",
"
TIPI3
\n",
"
~~
\n",
"
TIPI3
\n",
"
0.42
\n",
"
0.14
\n",
"
0.098041
\n",
"
4.277989
\n",
"
0.000019
\n",
"
\n",
"
\n",
"
29
\n",
"
TIPI4
\n",
"
~~
\n",
"
TIPI4
\n",
"
2.58
\n",
"
0.67
\n",
"
0.0403
\n",
"
64.097007
\n",
"
0.0
\n",
"
\n",
"
\n",
"
30
\n",
"
TIPI5
\n",
"
~~
\n",
"
TIPI5
\n",
"
0.00
\n",
"
0.00
\n",
"
0.184425
\n",
"
0.0
\n",
"
1.0
\n",
"
\n",
"
\n",
"
31
\n",
"
TIPI6
\n",
"
~~
\n",
"
TIPI6
\n",
"
2.15
\n",
"
0.68
\n",
"
0.032442
\n",
"
66.181825
\n",
"
0.0
\n",
"
\n",
"
\n",
"
32
\n",
"
TIPI7
\n",
"
~~
\n",
"
TIPI7
\n",
"
0.00
\n",
"
0.00
\n",
"
0.23821
\n",
"
0.0
\n",
"
1.0
\n",
"
\n",
"
\n",
"
33
\n",
"
TIPI8
\n",
"
~~
\n",
"
TIPI8
\n",
"
3.06
\n",
"
0.80
\n",
"
0.040223
\n",
"
76.133203
\n",
"
0.0
\n",
"
\n",
"
\n",
"
34
\n",
"
TIPI9
\n",
"
~~
\n",
"
TIPI9
\n",
"
0.20
\n",
"
0.06
\n",
"
0.085655
\n",
"
2.295285
\n",
"
0.021717
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" lval op rval Estimate Est. Std Std. Err \\\n",
"10 agreeableness ~~ agreeableness 0.26 1.00 0.024277 \n",
"11 agreeableness ~~ conscientious -0.09 -0.11 0.009263 \n",
"12 agreeableness ~~ extraversion -0.24 -0.24 0.021176 \n",
"13 agreeableness ~~ neuroticism 0.02 0.04 0.004264 \n",
"14 agreeableness ~~ openness -0.10 -0.14 0.009443 \n",
"15 conscientious ~~ conscientious 2.59 1.00 0.101503 \n",
"16 conscientious ~~ extraversion 0.19 0.06 0.021543 \n",
"17 conscientious ~~ neuroticism -0.58 -0.31 0.020122 \n",
"18 conscientious ~~ openness 0.20 0.09 0.015485 \n",
"19 extraversion ~~ extraversion 3.90 1.00 0.107465 \n",
"20 extraversion ~~ neuroticism -0.25 -0.11 0.016061 \n",
"21 neuroticism ~~ neuroticism 1.29 1.00 0.041927 \n",
"22 openness ~~ openness 2.01 1.00 0.185288 \n",
"23 openness ~~ extraversion 0.71 0.25 0.018164 \n",
"24 openness ~~ neuroticism -0.27 -0.17 0.01279 \n",
"25 TIPI1 ~~ TIPI1 0.00 0.00 0.101706 \n",
"26 TIPI10 ~~ TIPI10 2.60 0.95 0.025788 \n",
"27 TIPI2 ~~ TIPI2 3.24 0.93 0.035777 \n",
"28 TIPI3 ~~ TIPI3 0.42 0.14 0.098041 \n",
"29 TIPI4 ~~ TIPI4 2.58 0.67 0.0403 \n",
"30 TIPI5 ~~ TIPI5 0.00 0.00 0.184425 \n",
"31 TIPI6 ~~ TIPI6 2.15 0.68 0.032442 \n",
"32 TIPI7 ~~ TIPI7 0.00 0.00 0.23821 \n",
"33 TIPI8 ~~ TIPI8 3.06 0.80 0.040223 \n",
"34 TIPI9 ~~ TIPI9 0.20 0.06 0.085655 \n",
"\n",
" z-value p-value \n",
"10 10.793053 0.0 \n",
"11 -9.490274 0.0 \n",
"12 -11.338675 0.0 \n",
"13 5.307691 0.0 \n",
"14 -10.406511 0.0 \n",
"15 25.5296 0.0 \n",
"16 8.655804 0.0 \n",
"17 -28.611052 0.0 \n",
"18 12.884486 0.0 \n",
"19 36.269113 0.0 \n",
"20 -15.740773 0.0 \n",
"21 30.861919 0.0 \n",
"22 10.827292 0.0 \n",
"23 39.051735 0.0 \n",
"24 -21.433455 0.0 \n",
"25 0.0 1.0 \n",
"26 100.823661 0.0 \n",
"27 90.464184 0.0 \n",
"28 4.277989 0.000019 \n",
"29 64.097007 0.0 \n",
"30 0.0 1.0 \n",
"31 66.181825 0.0 \n",
"32 0.0 1.0 \n",
"33 76.133203 0.0 \n",
"34 2.295285 0.021717 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get variance/covariance\n",
"coefs[coefs['op'].str.contains('~~')].round(2)"
]
},
{
"cell_type": "markdown",
"id": "5aaf549b-9a50-45b2-996d-ce99a4f5002f",
"metadata": {},
"source": [
"We can also visualise our model structure by using the `semplot` function, for which we can save our structure to a file to put in a paper, talk, etc:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5736ea72-6ae3-43ad-a52b-f85e30dd33ef",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show and save\n",
"sem.semplot(model, filename='test.png', std_ests=True)"
]
},
{
"cell_type": "markdown",
"id": "a9791b6e-9b9c-40cb-9edd-5df8e951cf1a",
"metadata": {},
"source": [
"### Interpreting CFA models - fit statistics\n",
"While standard GLM's can be interpreted in terms of the quality of their predictions, it is harder to do this with CFA models. CFA models produce a range of 'fit statistics' which describe how well the model we suppose fits the data. We obtain these with the `sem.calc_stats` function on the model, like this:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "014d0c6c-7cc9-4df3-bd0d-4be52d2f9e17",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
DoF
\n",
"
25.000000
\n",
"
\n",
"
\n",
"
DoF Baseline
\n",
"
45.000000
\n",
"
\n",
"
\n",
"
chi2
\n",
"
6698.393370
\n",
"
\n",
"
\n",
"
chi2 p-value
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
chi2 Baseline
\n",
"
40411.592111
\n",
"
\n",
"
\n",
"
CFI
\n",
"
0.834680
\n",
"
\n",
"
\n",
"
GFI
\n",
"
0.834246
\n",
"
\n",
"
\n",
"
AGFI
\n",
"
0.701642
\n",
"
\n",
"
\n",
"
NFI
\n",
"
0.834246
\n",
"
\n",
"
\n",
"
TLI
\n",
"
0.702425
\n",
"
\n",
"
\n",
"
RMSEA
\n",
"
0.102870
\n",
"
\n",
"
\n",
"
AIC
\n",
"
59.468929
\n",
"
\n",
"
\n",
"
BIC
\n",
"
303.537844
\n",
"
\n",
"
\n",
"
LogLik
\n",
"
0.265535
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Value\n",
"DoF 25.000000\n",
"DoF Baseline 45.000000\n",
"chi2 6698.393370\n",
"chi2 p-value 0.000000\n",
"chi2 Baseline 40411.592111\n",
"CFI 0.834680\n",
"GFI 0.834246\n",
"AGFI 0.701642\n",
"NFI 0.834246\n",
"TLI 0.702425\n",
"RMSEA 0.102870\n",
"AIC 59.468929\n",
"BIC 303.537844\n",
"LogLik 0.265535"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Obtain fit statistics\n",
"fit_stats = sem.calc_stats(model)\n",
"display(fit_stats.T)"
]
},
{
"cell_type": "markdown",
"id": "29160b8d-d896-4223-8cc5-b66df676962c",
"metadata": {},
"source": [
"This array of statistics is confusing, but they all exist in some form to help us interpret how well the model fits the data. Here we highlight the ones that are worth paying attention to, and offer some rules-of-thumb in how to interpret them.\n",
"\n",
"- **DOF, DOF Baseline, chi2, and chi2 p-value** These values tell us whether the model we specify is satisfactorily reproducing the data that goes into it. Specifically, it is looking for the differences between the correlation matrix of all the variables in our observed data, and the correlation matrix of the predictions from the model. If that is large - i.e., the model is doing a bad job of replicating the data - then this test will be *significant*, and this *is a bad thing*. Confusing, I know! This test is however very sensitive to sample size, with larger samples like we have here producing significant results even if the discrepancy is quite small.\n",
"- **CFI** This is the *Comparative Fit Index*, which compares the fit of the model we specified to a null model that has no associations amongst variables. This has an established cut-off in the literature - it should be **greater than .90** for a 'good' fit.\n",
"- **GFI and AGFI** These are the *(Adjusted) Goodness of Fit* index, which are the CFA equivalent of $R^2$, or how much variance the model can explain in the data that goes into it. These too have cutoffs, and should be **greater than .90 and .95 respectively**.\n",
"- **NFI and TLI** The Normed Fit index, and its more conservative version, the Tucker-Lewis Index, are another way to check the discrepancy of the model against a null model by focusing on the chi-square statistic. Again, there is a strong suggestion these values **should be above .95**.\n",
"- **RMSEA** The root-mean-square-error of-approximation tells us - much like the RMSE in OLS - what the average amount of error the model gives. As such, a low value is better, and there is also a cutoff here - **it should be less than .06**.\n",
"\n",
"Taken together (and its a lot to take!), this suggests this model is *not a good fit*. In other words, the data do not support the proposed latent factor structure we have suggested. In this case, the standard advice would be to carry out an EFA and see what is going on."
]
},
{
"cell_type": "markdown",
"id": "cb18be33-0387-4c77-8d00-740c088f68e7",
"metadata": {},
"source": [
"### Another example - CFA models meet regression\n",
"Here, we demonstrate how CFA can be used to map observed variables to latent variables, and then see if those latent variables can predict another variable of interest. This sort of CFA - plus - regression approach technically falls under the SEM approach, but it is a useful demonstration of how all of these techniques are connected.\n",
"\n",
"To demonstrate, we'll use a dataset that comes with `semopy`, the Holzinger-Swineford -39 dataset. This classic dataset has a set of scores for seventh and eighth grade students on a a series of mental ability tests (in the dataset, they are `x1` through to `x9`). Interestingly, these tests are supposed to capture three specific latent aspects of intelligence/mental ability - visual, textual, and processing speed.\n",
"\n",
"First let us load the data and examine it:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "8324ccb9-757f-4315-a299-4389acb78ed0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
id
\n",
"
sex
\n",
"
ageyr
\n",
"
agemo
\n",
"
school
\n",
"
grade
\n",
"
x1
\n",
"
x2
\n",
"
x3
\n",
"
x4
\n",
"
x5
\n",
"
x6
\n",
"
x7
\n",
"
x8
\n",
"
x9
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
13
\n",
"
1
\n",
"
Pasteur
\n",
"
7.0
\n",
"
3.333333
\n",
"
7.75
\n",
"
0.375
\n",
"
2.333333
\n",
"
5.75
\n",
"
1.285714
\n",
"
3.391304
\n",
"
5.75
\n",
"
6.361111
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
0
\n",
"
13
\n",
"
7
\n",
"
Pasteur
\n",
"
7.0
\n",
"
5.333333
\n",
"
5.25
\n",
"
2.125
\n",
"
1.666667
\n",
"
3.00
\n",
"
1.285714
\n",
"
3.782609
\n",
"
6.25
\n",
"
7.916667
\n",
"
\n",
"
\n",
"
3
\n",
"
3
\n",
"
0
\n",
"
13
\n",
"
1
\n",
"
Pasteur
\n",
"
7.0
\n",
"
4.500000
\n",
"
5.25
\n",
"
1.875
\n",
"
1.000000
\n",
"
1.75
\n",
"
0.428571
\n",
"
3.260870
\n",
"
3.90
\n",
"
4.416667
\n",
"
\n",
"
\n",
"
4
\n",
"
4
\n",
"
1
\n",
"
13
\n",
"
2
\n",
"
Pasteur
\n",
"
7.0
\n",
"
5.333333
\n",
"
7.75
\n",
"
3.000
\n",
"
2.666667
\n",
"
4.50
\n",
"
2.428571
\n",
"
3.000000
\n",
"
5.30
\n",
"
4.861111
\n",
"
\n",
"
\n",
"
5
\n",
"
5
\n",
"
0
\n",
"
12
\n",
"
2
\n",
"
Pasteur
\n",
"
7.0
\n",
"
4.833333
\n",
"
4.75
\n",
"
0.875
\n",
"
2.666667
\n",
"
4.00
\n",
"
2.571429
\n",
"
3.695652
\n",
"
6.30
\n",
"
5.916667
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id sex ageyr agemo school grade x1 x2 x3 x4 \\\n",
"1 1 1 13 1 Pasteur 7.0 3.333333 7.75 0.375 2.333333 \n",
"2 2 0 13 7 Pasteur 7.0 5.333333 5.25 2.125 1.666667 \n",
"3 3 0 13 1 Pasteur 7.0 4.500000 5.25 1.875 1.000000 \n",
"4 4 1 13 2 Pasteur 7.0 5.333333 7.75 3.000 2.666667 \n",
"5 5 0 12 2 Pasteur 7.0 4.833333 4.75 0.875 2.666667 \n",
"\n",
" x5 x6 x7 x8 x9 \n",
"1 5.75 1.285714 3.391304 5.75 6.361111 \n",
"2 3.00 1.285714 3.782609 6.25 7.916667 \n",
"3 1.75 0.428571 3.260870 3.90 4.416667 \n",
"4 4.50 2.428571 3.000000 5.30 4.861111 \n",
"5 4.00 2.571429 3.695652 6.30 5.916667 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pull the data from the examples\n",
"hlz = sem.examples.holzinger39.get_data().dropna(how='any').replace({'sex': {2: 0}})\n",
"hlz.head()"
]
},
{
"cell_type": "markdown",
"id": "ce7d8b20-6950-4ca1-ac8e-ef13a3dd6029",
"metadata": {},
"source": [
"The variables x1 through to x9 represent the tests.\n",
"- `x1`, `x2`, `x3` represent visual ability\n",
"- `x4`, `x5`, `x6` represent text ability\n",
"- `x7`, `x8`, `x9` represent processing speed.\n",
"- `sex` represents the sex of the child.\n",
"\n",
"Imagine we wanted to predict the sex of a child given scores on the variables - i.e., could we classify a child's sex given their test scores? With only the observed variables, we would have to run a model like this (ideally a logistic model, but we can use a GLM as an example):\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "19776fb4-f992-4821-bf1c-450efb7d40b7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Value\n",
"DoF 24.00\n",
"DoF Baseline 36.00\n",
"chi2 85.94\n",
"chi2 p-value 0.00\n",
"chi2 Baseline 918.20\n",
"CFI 0.93\n",
"GFI 0.91\n",
"AGFI 0.86\n",
"NFI 0.91\n",
"TLI 0.89\n",
"RMSEA 0.09\n",
"AIC 41.43\n",
"BIC 119.21\n",
"LogLik 0.29"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get fit\n",
"sem.calc_stats(model).T.round(2)"
]
},
{
"cell_type": "markdown",
"id": "32ab799c-edc9-46f2-8ba7-20b490ce1e13",
"metadata": {},
"source": [
"Not *too* bad overall, though a mix of different statistics that might suggest that the model isn't a solid fit, but it hints at the fact the three latent variables underpin the scores. We can now expand our CFA model to let the latent variables actually predict the sex of the child, all in one step, like so:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "54371eb6-4a38-4c5b-a5e6-fd0b04f9e9f4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
lval
\n",
"
op
\n",
"
rval
\n",
"
Estimate
\n",
"
Est. Std
\n",
"
Std. Err
\n",
"
z-value
\n",
"
p-value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
x1
\n",
"
~
\n",
"
visual
\n",
"
1.000000
\n",
"
0.740218
\n",
"
-
\n",
"
-
\n",
"
-
\n",
"
\n",
"
\n",
"
1
\n",
"
x2
\n",
"
~
\n",
"
visual
\n",
"
0.600939
\n",
"
0.440813
\n",
"
0.102441
\n",
"
5.866185
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2
\n",
"
x3
\n",
"
~
\n",
"
visual
\n",
"
0.796922
\n",
"
0.609516
\n",
"
0.11097
\n",
"
7.181421
\n",
"
0.0
\n",
"
\n",
"
\n",
"
3
\n",
"
x4
\n",
"
~
\n",
"
text
\n",
"
1.000000
\n",
"
0.855219
\n",
"
-
\n",
"
-
\n",
"
-
\n",
"
\n",
"
\n",
"
4
\n",
"
x5
\n",
"
~
\n",
"
text
\n",
"
1.104548
\n",
"
0.853538
\n",
"
0.064873
\n",
"
17.026223
\n",
"
0.0
\n",
"
\n",
"
\n",
"
5
\n",
"
x6
\n",
"
~
\n",
"
text
\n",
"
0.918146
\n",
"
0.835135
\n",
"
0.055047
\n",
"
16.679216
\n",
"
0.0
\n",
"
\n",
"
\n",
"
6
\n",
"
x7
\n",
"
~
\n",
"
speed
\n",
"
1.000000
\n",
"
0.565775
\n",
"
-
\n",
"
-
\n",
"
-
\n",
"
\n",
"
\n",
"
7
\n",
"
x8
\n",
"
~
\n",
"
speed
\n",
"
1.158891
\n",
"
0.707608
\n",
"
0.162125
\n",
"
7.148138
\n",
"
0.0
\n",
"
\n",
"
\n",
"
8
\n",
"
x9
\n",
"
~
\n",
"
speed
\n",
"
1.116640
\n",
"
0.682020
\n",
"
0.156292
\n",
"
7.144581
\n",
"
0.0
\n",
"
\n",
"
\n",
"
9
\n",
"
sex
\n",
"
~
\n",
"
visual
\n",
"
0.200401
\n",
"
0.346385
\n",
"
0.060752
\n",
"
3.298669
\n",
"
0.000971
\n",
"
\n",
"
\n",
"
10
\n",
"
sex
\n",
"
~
\n",
"
text
\n",
"
-0.096535
\n",
"
-0.192281
\n",
"
0.037118
\n",
"
-2.600795
\n",
"
0.009301
\n",
"
\n",
"
\n",
"
11
\n",
"
sex
\n",
"
~
\n",
"
speed
\n",
"
-0.126927
\n",
"
-0.156546
\n",
"
0.071654
\n",
"
-1.771402
\n",
"
0.076494
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" lval op rval Estimate Est. Std Std. Err z-value p-value\n",
"0 x1 ~ visual 1.000000 0.740218 - - -\n",
"1 x2 ~ visual 0.600939 0.440813 0.102441 5.866185 0.0\n",
"2 x3 ~ visual 0.796922 0.609516 0.11097 7.181421 0.0\n",
"3 x4 ~ text 1.000000 0.855219 - - -\n",
"4 x5 ~ text 1.104548 0.853538 0.064873 17.026223 0.0\n",
"5 x6 ~ text 0.918146 0.835135 0.055047 16.679216 0.0\n",
"6 x7 ~ speed 1.000000 0.565775 - - -\n",
"7 x8 ~ speed 1.158891 0.707608 0.162125 7.148138 0.0\n",
"8 x9 ~ speed 1.116640 0.682020 0.156292 7.144581 0.0\n",
"9 sex ~ visual 0.200401 0.346385 0.060752 3.298669 0.000971\n",
"10 sex ~ text -0.096535 -0.192281 0.037118 -2.600795 0.009301\n",
"11 sex ~ speed -0.126927 -0.156546 0.071654 -1.771402 0.076494"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Identify latent factors\n",
"cfa_model = \"\"\"\n",
"# Specifies the latent factors\n",
"visual =~ x1 + x2 + x3\n",
"text =~ x4 + x5 + x6\n",
"speed =~ x7 + x8 + x9\n",
"\n",
"# Uses them to predict sex\n",
"sex ~ visual + text + speed\n",
"\"\"\"\n",
"\n",
"# Model\n",
"model = sem.Model(cfa_model)\n",
"model.fit(hlz)\n",
"\n",
"# Inspect\n",
"model.inspect(std_est=True).query('op == \"~\"')"
]
},
{
"cell_type": "markdown",
"id": "3bfbde4b-6019-4d29-b821-d751acbbc77d",
"metadata": {},
"source": [
"We can see here that the visual and text latent scores are negative predictors, e.g. as latent scores go up, the likelihood of the scores belonging to one sex or another decreases.\n",
"\n",
"As is hopefully clear, the idea of CFA is quite general - it allows us to specify a set of latent variables, and link them to specific observed variables, testing the goodness of fit of that arrangement. If we want to, we can further expand these models to use the latent variables to predict other variables, just like a standard GLM."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}